Handling errors in an error-tolerant distributed computer system

ABSTRACT

Method of dealing with faults in a fault-tolerant distributed computer system, as well as such a system, with a plurality of node computers (K1 . . . K4) which are connected by means of communication channels (c11 . . . c42) and access to the channels takes place according to a cyclic time slicing method. Messages leaving node computers (K1 . . . K4) are checked by independently formed guardians (GUA) which convert a message burdened with an SOS (“slightly off specification”) fault either into a correct message or into a message which can be recognised by all node computers as clearly incorrect.

[0001] The invention relates to a method of dealing with faults in afault-tolerant distributed computer system with a plurality of nodecomputers which are connected by means of communication channels andeach node computer has an autonomous communication control unit whereinaccess to the communication channels takes place according to a cyclictime slicing method and the correctness of messages leaving the nodecomputers is checked by guardians.

[0002] Likewise the invention refers to a fault-tolerant distributedcomputer system with a plurality of node computers which are connectedto each other by means of at least of one distributor unit andcommunication channels wherein each node computer has an autonomouscommunication control unit, access to the communication channels takesplace according to a cyclic time slicing method and guardians areprovided for the purpose of checking the messages leaving nodecomputers.

[0003] Technical applications which are critical in terms of security,that is to say applications where a fault can lead to a disaster, areincreasingly led by distributed fault-tolerant real-time computersystems.

[0004] In a distributed fault-tolerant real-time computer systemcomprising a number of node computers and a real-time communicationsystem each individual breakdown of a node computer will be tolerated.In the core of such computer architecture there is a fault-tolerantreal-time communication system for the predictably fast and secureexchange of messages.

[0005] A communication protocol which fulfils these requirements isdescribed in EP 0658 257 corresponding to U.S. Pat. No. 5,694,542. Theprotocol has become known under the name “Time-Triggered Protokoll/C(TTP/C)” and is also disclosed in Kopetz, H. (1997) Real-Time Systems,Design Principles for Distributed Embedded Applications; ISBN:0-7923-9894-7, Boston, Kluwer Academic Publishers. It is based upon theknown cyclic time slicing method (TDMA: time-division multiple access)with a priori fixed time slices. TTP/C uses a method of fault-tolerantclock synchronisation which is disclosed in U.S. Pat. No. 4,866,606.

[0006] TTP/C presupposes that the communication system supports alogical broadcast topology and that the node computers displayfail-silence breakdown behaviour, i.e. either the node computersfunction correctly in the value domain and in the time domain or theyare quiet. The prevention of faults in the time domain, that is to saythe so-called “babbling idiot” faults, is achieved in TTP/C by anindependent fault recognition unit, called a “bus guardian” there, whichhas an independent time basis and continuously checks the time behaviourof the node computer. In order to realise the fault tolerance severalfail-silent node computers are brought together to form a fault-tolerantunit (FTU) and the communication system is replicated. So long as a nodecomputer of an FTU and a replica of the communication system functionthe services of the FTUs are provided punctually in the time domain andvalue domain.

[0007] A logical broadcast topology of the communication can bephysically constructed either through a distributed bus system, adistributed ring system or through a central distributor unit (e.g. astar coupler) with point-to-point connections to the node computers. Ifa distributed bus system or a distributed ring system is constructedeach node computer must have its own bus guardian. If on the other handa central distributor unit is used all guardians can be integrated intothis distributor unit which due to the global observation of thebehaviour of all nodes can effectively force regular sending behaviourin the time domain. This is described in the subsequently published WO01/13230 A1.

[0008] In a distributed computer system faults which can lead to aninconsistent state of the system are particularly critical. As anexample a so-called brake-by-wire application in a motor car is citedhere wherein a central brake computer sends brake messages to four wheelcomputers in the wheels. If a brake message is correctly received by twowheel computers and the other two wheel computers do not receive themessage an inconsistent state arises. If braking of two wheels which areon the same side of the vehicle takes place the vehicle can go out ofcontrol. The type of fault described here is referred to in literatureas a Byzantine fault (Kopetz, p. 60, p. 133). The fast recognition andcorrect handling of Byzantine faults is one of the most difficultproblems in computing.

[0009] A sub-class of Byzantine faults is formed by the“slightly-off-specification” faults (in short SOS faults). An SOS faultcan occur on the interface between analog technology and digitaltechnology. In the present specialised area “digital signals” areunderstood to be logical signals but “analog signals” are understood tobe all physical signals. The distinction between analog and digitaltechnology is also to be understood here in this sense. In therealisation of a data transfer each logical bit can be represented onthe line by a signal value (e.g. voltage from a specified voltagetolerance interval) during a specified interval of time. A correctsender must generate its analog signals within the specified toleranceintervals in order to ensure that all correct recipients also correctlyinterpret these signals. If a sender of a message generates a signalslightly (slightly-off-specification) outside of the specified interval(in the value domain, in the time domain or in both) the case can arisewherein a few recipients correctly interpret this signal while otherrecipients cannot interpret the signal correctly. We refer to such abroadcast message as SOS-false. Subsequently a Byzantine fault, asdescribed above with reference to a brake system, can occur. Such afault can be caused by a defective voltage supply, a defective clock ora component which has deteriorated through ageing. The transfer of amessage to two communication channels cannot prevent SOS faults if thereason for the fault, e.g. a defective clock of the computer node whichgenerates the bit sequence, affects both channels.

[0010] It is a principle of security technology to recognise errorsarising at the earliest possible point in time in order to be able totake counter-measures before subsequent faults cause further damage.This principle is complied with in the said TTP/C protocol (EP 0 658257) in that SOS faults are consistently recognised by means of theso-called membership algorithm of the TTP/C protocol within a maximum oftwo TDMA rounds. As SOS faults typically consist of very rarelyoccurring transient faults, SOS faults of the virtually coincidentmultiple faults which are also a very rarely occurring class areassigned in an existing prototype implementation of TTP/C and handled assuch.

[0011] It is object of the invention to facilitate toleration of faultsof the SOS class in a distributed computer system through appropriatemeasures.

[0012] This object is achieved with a method of the type mentioned atthe outset wherein according to the invention the independently formedguardians convert a message burdened with an SOS (“slightly offspecifications”) fault either into a correct message or into a messagewhich can be recognised by all receiving node computers as clearlyincorrect.

[0013] The object is also achieved with a fault-tolerant distributedcomputer system of the type indicated above wherein according to theinvention the independently formed guardians are adapted to convert amessage burdened with an SOS (“slightly off specifications”) faulteither into a correct message or into a message which can be recognisedby all receiving node computers as clearly incorrect.

[0014] Thanks to the invention the fault category of “slightly offspecification” (SOS) faults can also be tolerated in a time-controlled,distributed, fault-tolerant architecture for highly reliable real-timecomputer applications.

[0015] In an advantageous embodiment it is provided that eachindependent guardian with the support of its independent time basischecks whether the start of a message sent by the communication controlunit of a node computer falls within the start time window of themessage which is known a priori to the guardian and which immediatelycloses the corresponding communication channel if the message liesoutside of this time window in order that an incomplete message whichcan be recognised by all receiving node computers as incorrect isproduced. In this way the occurrence of only slightly distorted messageswhich may be wrongly interpreted by the recipients as correct can beprevented.

[0016] Furthermore it is useful if a guardian regenerates the incomingphysical signal of each message in the time domain and value domaintaking into consideration the relevant coding regulations and using itslocal time basis and its local power supply. Such an independentregeneration considerably increases the required security of the system.

[0017] Another advantageous embodiment of the invention provides that aguardian not receiving any messages generates no messages with correctCRC and correct length. This measure can also further increase thesecurity of the system.

[0018] Optimum control on the basis of the start time window providesthat the start time window of a guardian begins by more than theprecision of the system after the start time window of a node computerand the start time window of a guardian ends by more than the precisionbefore the start time window of a node computer.

[0019] Additional advantages not only in relation to security but alsowith regard to the realisation costs of the system ensue if theguardians are integrated into the distributor unit, of which there is atleast one, and the distributor unit has an independent power supply andindependent fault-tolerant distributed clock synchronisation.

[0020] The invention together with further advantages is described ingreater detail below by reference to example embodiments which areillustrated in the drawings. The latter show:

[0021]FIG. 1 schematically a distributed computer system comprising fournode computers which are connected to each other by means of tworeplicated central distributor units,

[0022]FIG. 2 a Fault Containment Unit formed by a node computer and twoguardians and FIG. 3 the position of the start time window of a guardianand a node computer.

[0023]FIG. 1 shows a system of four node computers K1, K2, K3, K4wherein each node computer forms an exchangeable unit and is connectedwith a point-to-point connection or communication channel c11 . . . c42to one of two replicated central distributor units V1 or V2. Betweeneach output of a node computer and each input of the distributor unitthere is a guardian GUA which is either designed to be independent orcan be integrated into the distributor unit. The principle function of aguardian or bus guardian is explained in Kopetz, p. 173. In order to beable to fulfil its function a guardian also requires, besides acontroller, switches in order to open or lock channels. Twounidirectional communication channels v21, v12 between the distributorunits V1 and V2 serve for the reciprocal monitoring and the informationexchange of the central distributor units V1 and V2. As also ensues fromKopetz, e.g. p. 172-177, each node computer K1 . . . K4 has anautonomous controller CON or communication controller which is connectedto the replicated communication channels, e.g. c11, c12. Indicatedconnections w1, w2 are dedicated communication channels. They lead toservice computers w1, w2 which can monitor the parameters of thedistributor units and the correct functions of the same.

[0024]FIG. 2 shows a node computer K1 with its communication controllerCON and the communication channels c11, c21 to the other node computersor distributor units of the distributed computer system. The guardiansGUA are provided as bus guardians here for the communication channelsc11, c21 but they can be integrated into the two independent centraldistributor units V1, V2 according to FIG. 1. From a logical viewpointthe three sub-systems node computer+two guardians form a unit which isreferred to here as a “Fault Containment Unit” (FCU) and is indicated inthis way in FIG. 2. As stated, this is independently of whether theguardians GUA are physically integrated into the central distributorunits or into the node computers.

[0025] Reference is now made to FIG. 3, in which start time windows forthe start of a message are inc1uded. A distinction is made between thestart window T_(CON) with precisely this length T_(CON) of a nodecomputer or its controller and the start time window T_(GUA) of aguardian. The invention provides that the time window T_(GUA) of aguardian is shorter than the time window T_(CON) of a node computer andbetween the time window T_(GUA) embedded in the window T_(CON) there isan interval t1/t2 which is greater than the precision P of the system.The concept of precision is c1arified e.g. in Kopetz, Chapter 3.1.3“Precision and Accuracy”, p. 49 and 50.

[0026] We refer to a given fault of an active sub-system, e.g. of thenode computer K1, as unconstrained active. Furthermore we refer to afault of a passive sub-system, e.g. of a guardian or of a connection c11or c22 as unconstrained passive if it is ensured through theconstruction of the passive sub-system that this sub-system cannotgenerate a bit sequence from itself, i.e. without input from an activesub-system, which can be interpreted by a recipient as a syntacticallycorrect message. A message is syntactically correct if a CRC check doesnot indicate any faults, it has the correct length, corresponds to thecoding regulations and arrives within the expected interval of time.

[0027] If a passive sub-system does not have the knowledge of how togenerate a correct CRC (has no access to the CRC generation algorithm)and how long a correct message must be, on the basis of statisticalrandom processes (disturbances) the probability of a syntacticallycorrect message being produced is negligibly low.

[0028] A Fault Containment Unit FCU can convert an unconstrained activefault of a node computer K1 or an unconstrained passive fault of one ofthe two guardians GUA into a fault which is not a Byzantine fault if thefollowing assumptions are fulfilled:

[0029] (i) a correct node computer K1 sends the same syntacticallycorrect message on both channels c11 and c12 and

[0030] (ii) a correct guardian GUA converts an SOS-false message fromthe node computer K1 either into a syntactically correct message or intoa message which can be recognised as clearly incorrect by all recipients(non-SOS message) and

[0031] (iii) during the sending of a message a maximum of one of theindicated sub-systems is defective.

[0032] Due to the fault assumption (iii) only a single one of the threeindicated sub-systems K1, GUA, GUA can be defective. If the nodecomputer K1 is unconstrained defective both the guardians GUA and GUAare not defective and generate according to assumption (ii) non-SOSmessages. If one of the two guardians GUA is unconstrained passivedefective the node computer K1 generates a syntactically correct messageand transfers this syntactically correct message to both guardians GUA(assumption i). The correct guardian GUA then transfers the messagecorrectly to all recipients, i.e. node computers. Due to the receptionlogic and the self-reliance principle of the TTP/C protocol in this caseall the correct recipients will select the correct message and classifythe sending node computer as correct. In order to tolerate SOS faults nochange in the TTP/C protocol is necessary.

[0033] A given message can be SOS-false for the following three reasons:

[0034] (i) the message has an SOS-fault in the value domain and/or

[0035] (ii) the message has an inner SOS fault in the time domain (e.g.timing fault within the code) and/or

[0036] (iii) the transfer of the message is begun slightly outside ofthe specified sending interval (see FIG. 3).

[0037] A correct guardian (GUA) converts these reasons for faults asfollows into non-SOS faults:

[0038] (i) The output values of the message are regenerated by aguardian GUA with the independent voltage supply of the guardian.

[0039] (ii) The coding of the message is regenerated by a guardian GUAwith the independent time basis of the bus guardian.

[0040] (iii) The guardian locks the channel as soon as it recognisesthat the transfer has begun outside of the specified interval of timeT_(GUA). All recipients, i.e. node computers, therefore receive greatlydistorted messages which are recognised as defective.

[0041] A locking of the channel by a guardian GUA directly after thespecified end of the transfer time of a message is generally notsufficient to prevent SOS-faults as it is not excluded that a messagewhich is slightly distorted through the locking can be a trigger for anSOS-fault of a guardian GUA which is fault-free in itself. If bothguardians slightly distort the message in the same way an SOS-fault canarise at system level.

[0042] Finally it is emphasised that this invention is not limited tothe realisation described with four node computers but can be expandedas desired. It can be used not only with the TTP/C protocol but alsowith other time-controlled protocols.

1. Method of dealing with faults in a fault-tolerant distributedcomputer system with a plurality of node computers (K1 . . . K4) whichare connected by means of communication channels (c11 . . . c42) andeach node computer has an autonomous communication control unit (CON)wherein access to the communication channels takes place according to acyclic time slicing method and the correctness of messages leaving thenode computers is checked by guardians (GUA) and characterised in thatthe independently formed guardians (GUA) convert a message burdened withan SOS (“slightly off specifications”) fault either into a correctmessage or into a message which can be recognised by all receiving nodecomputers (K1 . . . K4) as clearly incorrect.
 2. Method according toclaim 1 characterised in that each independent guardian (GUA) with thesupport of its independent time basis checks whether the start of amessage sent by the communication control unit (CON) of a node computer(K1 . . . K4) falls within the start time window (T_(GUA)) of themessage known a prior to the guardian (GUA) and which immediately closesthe corresponding communication channel (c11 . . . c42) if the messagelies outside of this time window in order that an incomplete messagewhich can be recognised by all receiving node computers as incorrect isformed.
 3. Method according to claim 1 or 2 characterised in that aguardian (GUA) regenerates the incoming physical signal of each messagein the time domain and value domain taking into consideration therelevant coding regulations and using its local time basis and its localpower supply.
 4. Method according to one of the claims 1 to 3characterised in that a guardian (GUA) not receiving any messages doesnot generate any messages with correct CRC and correct length.
 5. Methodaccording to one of the claims 2 to 4 characterised in that the starttime window (T_(GUA)) of a guardian (GUA) starts by more than theprecision (P) of the system after the start time window (T_(CON)) of acost computer (K1 . . . K4) and the start time window of a guardian endsby more than the precision before the start time window of a nodecomputer.
 6. Fault-tolerant distributed computer system with a pluralityof node computers (K1 . . . K4) which are connected to each other by atleast one distributor unit (V1, V2) and communication channels (c11 . .. c42), each node computer has an autonomous communication control unit(CON), access to the communication channels takes place according to acyclic time slicing method and guardians (GUA) are provided for thepurpose of checking the messages leaving the node computerscharacterised in that the independently formed guardians (GUA) areadapted to convert a message burdened with an SOS (“slightly offspecifications”) fault either into a correct message or into a messagewhich can be recognised by all receiving node computers (K1 . . . K4) asclearly incorrect.
 7. Computer system according to claim 6 characterisedin that a guardian (GUA) has an independent time basis and is adapted tocheck whether the start of a message sent by the communication controlunit (CON) of a node computer (K1 . . . K4) falls within the start timewindow (T_(GUA))of the message known a priori to the guardian (GUA) aswell as to immediately close the corresponding communication channel(c11 . . . c42) if the message lies outside of this time window in orderthat an incomplete message is formed which can be recognised by allreceiving node computers as incorrect.
 8. Computer system according toclaim 5 or 6 characterised in that a guardian (GUA) is adapted toregenerate the incoming physical signal of each message in the timedomain and value domain taking into consideration the relevant codingregulations and using its local time basis and its local power supply.9. Computer system according to one of the claims 6 to 8 characterisedin that a guardian (GUA) is adapted, in the event of its not receiving amessage, to not generate any messages with correct CRC and correctlength either.
 10. Computer system according to one of the claims 6 to 9characterised in that the beginning of the start time window (T_(CON))of a node computer (K1 . . . K4) lies by more than the precision (P) ofthe system before the beginning of the start time window (T_(GUA)) of aguardian and the end of the start time window of a guardian lies by morethan the precision before the end of the start time window of a costcomputer.
 11. Computer system according to one of the claims 6 to 10characterised in that the guardians (GUA) are integrated into thedistributor unit (V1, V2), of which there is at least one, and thedistributor unit has an independent power supply and independentfault-tolerant distributed chock synchronisation.